Modeling

Reporting, summarising and communicating models in R


Last session we talked a bit about how you might scrape data from the web. Collecting data is not an end in itself, rather the first step in answering your research question. When it comes to analyzing trends, testing hypotheses, and predicting outcomes, modelling enters the picture.

In this session, we won’t focus on the statistical considerations that inform your model choice, but rather show you how to:

  • Use formulas to specify multiple models
  • Process (estimate) model results with broom
  • Summarize outputs with modelsummary
  • Communicate results through plots and tables

You will have full courses dedicated to thinking about modelling choices, such as Causal Inference and Machine Learning.


The Model Workflow with R

While the last session dealt with one of the flashier aspects of data science (Web-Scraping), today we will return to the bread and butter 🍞 of any dedicated data scientist’s tool kit - modelling!

But if we aren’t even talking about Machine Learning, Deep Learning or model choice, why bother? Well, as with most things, it is worth revisiting fundamentals from time to time. By investing just a little more effort here, we can create much better reports and papers!

In general, remember that your basic workflow for evaluating and reporting models is as follows:

Today we will mostly deal with the “Model” and “Communicate” steps shown in the graph.


Setup

Before we start coding, we need to load several packages as well as our data. We’ll be using data from the World Health Organization on life expectancy.

# load packages 
pacman::p_load(kableExtra, # for nicely outputted HTML tables
               tidyverse, # for the suite of tidy packages
               broom, # to extract key information from statistical models into tidy data 
               modelsummary, # flexible model output tables 
               specr, # to conduct specification curve analyses
               janitor, # to clean up out data
               modelr, # tidy syntax modelling
               wesanderson) # color palette for ggplot2

# load data
life_expec_dat <- readr::read_csv("life_expectancy.csv") |> 
  janitor::clean_names() |> 
  dplyr::mutate(large = if_else(population > 50000000, 1, 0))  # pop size > 50M

head(life_expec_dat)
FALSE # A tibble: 6 × 23
FALSE   country      year status life_expectancy adult_mortality infant_deaths alcohol
FALSE   <chr>       <dbl> <chr>            <dbl>           <dbl>         <dbl>   <dbl>
FALSE 1 Afghanistan  2015 Devel…            65               263            62    0.01
FALSE 2 Afghanistan  2014 Devel…            59.9             271            64    0.01
FALSE 3 Afghanistan  2013 Devel…            59.9             268            66    0.01
FALSE 4 Afghanistan  2012 Devel…            59.5             272            69    0.01
FALSE 5 Afghanistan  2011 Devel…            59.2             275            71    0.01
FALSE 6 Afghanistan  2010 Devel…            58.8             279            74    0.01
FALSE # ℹ 16 more variables: percentage_expenditure <dbl>, hepatitis_b <dbl>,
FALSE #   measles <dbl>, bmi <dbl>, under_five_deaths <dbl>, polio <dbl>,
FALSE #   total_expenditure <dbl>, diphtheria <dbl>, hiv_aids <dbl>, gdp <dbl>,
FALSE #   population <dbl>, thinness_1_19_years <dbl>, thinness_5_9_years <dbl>,
FALSE #   income_composition_of_resources <dbl>, schooling <dbl>, large <dbl>

Using Formulas in R 👨‍🏫

R, as a programming language, was specifically designed with statistical analysis in mind. Therefore the founding fathers of R 🙌 - in their infinite wisdom - decided to create a special object class (called formula) to help with running models.


Syntax

A quick recap of the syntax used within formulas. The ~ (tilde) sign generally differentiates between dependent and independent variables. The + sign is used to distinguish between independent variables. Here are a few more common formula inputs that are useful to know:

y ~ x1 + x2 # standard formula

y ~ . # include all other columns in data as x vars

y ~ x1 * x2 # interaction terms

y ~ x + I(x^2) # higher order terms

Formulas are generally straightforward to work with. To avoid any potential errors, you should follow these two steps:

Step 1: Create a string containing the written formula.

formula_string <- paste("life_expectancy", "~ gdp")
formula_string
## [1] "life_expectancy ~ gdp"

Step 2: Transform the string into R’s formula class. Only then will your modeling function accept it as input.

form <- as.formula(formula_string) # needs to be transformed into formula class
form
## life_expectancy ~ gdp

Here is the proof that both ways of doing it are identical.

reg1 <- lm(form, data = life_expec_dat) 
reg2 <- lm(life_expectancy ~ gdp, data = life_expec_dat)

reg1$coefficients == reg2$coefficients
## (Intercept)         gdp 
##        TRUE        TRUE

Iteration

In instances where you might need to fit many models, formulas can help with making this process iterative.

Often, the modelling process requires you to run the same specification with multiple configurations of both dependent and independent variables. Model formulas make running many similar models much easier:

Step 1: Define a function that let’s you plug in different variables for x or y. The function should take the x and/or y variable(s) as a string and return a model object.

# function to include different y variables
lm_fun_iter_y <-  function(dep_var) {
  lm(as.formula(paste(dep_var, "~ gdp")), data = life_expec_dat)
}

# function to include different x variables 
lm_fun_iter_x <- function(indep_var) {
  lm(as.formula(paste("life_expectancy ~", paste(indep_var, collapse = "+"))), data = life_expec_dat)
}

Notice: Considering your model will likely include many independent variables, it’s unlikely that you’ll run a simple bivariate regression. Therefore, we need to use a nested paste function that combines (with collapse()) all independent variables contained in a character vector with +.

Step 2: Use purrr::map() (refer to lab session 4) to iteratively apply the model to each variable contained in a character vector of variables:

# create vector of variables to iterate over
vars <- life_expec_dat |> 
  dplyr::select(-c("life_expectancy", "country")) |> 
  names()

vars
##  [1] "year"                            "status"                         
##  [3] "adult_mortality"                 "infant_deaths"                  
##  [5] "alcohol"                         "percentage_expenditure"         
##  [7] "hepatitis_b"                     "measles"                        
##  [9] "bmi"                             "under_five_deaths"              
## [11] "polio"                           "total_expenditure"              
## [13] "diphtheria"                      "hiv_aids"                       
## [15] "gdp"                             "population"                     
## [17] "thinness_1_19_years"             "thinness_5_9_years"             
## [19] "income_composition_of_resources" "schooling"                      
## [21] "large"
# run a bivariate model for each column
biv_model_out <-  vars |>
  purrr::map(lm_fun_iter_x)

biv_model_out
## [[1]]
## 
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var, 
##     collapse = "+"))), data = life_expec_dat)
## 
## Coefficients:
## (Intercept)         year  
##    -635.872        0.351  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var, 
##     collapse = "+"))), data = life_expec_dat)
## 
## Coefficients:
##      (Intercept)  statusDeveloping  
##             79.2             -12.1  
....

This returns output from all of the models that you’ve just run. Given that there are now quite a few models, it might be a good idea to keep the names of the independent variables that you feed into the model. You can do this with purrr::set_names().

# run a bivariate model for each column
biv_model_out_w_names <-  vars |>
  purrr::set_names() |> 
  purrr::map(lm_fun_iter_x)

biv_model_out_w_names$year # we can now easily index independent vars
## 
## Call:
## lm(formula = as.formula(paste("life_expectancy ~", paste(indep_var, 
##     collapse = "+"))), data = life_expec_dat)
## 
## Coefficients:
## (Intercept)         year  
##    -635.872        0.351

Say that you’re interested in seeing what effect gdp has on life_expectancy, but you suspect that other covariates such as alcohol consumption, health care expenditure, average body mass index and the propagation of aids also have an effect.

On top of this, it’s possible that these variables might have different effects on life_expectancy depending on how they interact with each other. We can determine which independent variables are robustly associated with the dependent variable by generating models for every possible combination of independent variables and then looking into the distribution of effects across these models.

Let’s review the code introduced in the lecture:

Step 1: Find out how many possible combinations of independent and dependent variables exist:

combinations <- 
  purrr::map(1:5, function(x) {combn(1:5, x)}) |> # create all possible combinations (assume we have 5 covariates)
  purrr::map(ncol) |> # extract number of combinations
  unlist() |> # pull out of list structure
  sum() # compute sum

combinations
## [1] 31

Step 2: Write a function that can run all possible combinations in one go:

combn_models <- function(depvar, covars, data) {
  
  # initialize empty list
  combn_list <- list()
  
  # generate list of covariate combinations
  for (i in seq_along(covars)) {
    combn_list[[i]] <- combn(covars, i, simplify = FALSE)
  } # what is combn?
  
  # unlist to make object easier to work with
  combn_list <-
    unlist(combn_list, recursive = FALSE) 
  
  # function to generate formulas
  gen_formula <- function(covars, depvar) {
    form <-
      as.formula(paste0(depvar, " ~ ", paste0(covars, collapse = "+")))
    form
  }
  
  # generate formulas
  formulas_list <-
    purrr::map(combn_list, gen_formula, depvar = depvar)
  
  # run models
  models_list <- purrr::map(formulas_list, lm, data = data)
  models_list
}

Step 3: Apply the function to our case:

depvar <- "life_expectancy"
covars <-  c("gdp", "percentage_expenditure", "alcohol", "bmi", "hiv_aids")
multiv_model_out <- combn_models(depvar = depvar, covars = covars, data = life_expec_dat)

# check whether we have the correct number of models
length(multiv_model_out)
## [1] 31

As you can see, the function introduced in the lecture is quite powerful, but it is by no means the only option for testing multiple model specifications. Depending on the type of analysis that you plan to perform, certain R packages enable you run many different combinations of regression models with just a few lines of code. For example, the we can perform multiple estimations with the fixest package.


Specification Curve Analysis ⤴️

In the example above, we consciously chose which models to run. Ideally, this decision should be grounded in theory and knowledge of the subject matter. Alternatively, some researchers argue that such choices are inherently arbitrary, potentially resulting in biased model specification. They propose taking a more analytical approach to model specification which they refer to as Specification Curve Analysis, or Multiverse Analysis.

We can implement this approach using the specr package. In simple terms, it works as follows:

Step 1: Specify a list of reasonable “model ingredients” for each model parameter:

specs <- specr::setup(
  data = life_expec_dat, 
  y = c("life_expectancy"), # add dependent variable 
  x = c("gdp", "log(gdp)", "I(gdp^2)"), # add independent variables 
  model = c("lm", "glm"), # specify model types
  controls = c("population", "alcohol", "bmi"), # specify controls
  subsets = as.list(distinct(life_expec_dat, status)) # specify subgroup
)

summary(specs, rows = 20)
## Setup for the Specification Curve Analysis
## -------------------------------------------
## Class:                      specr.setup -- version: 1.0.0 
## Number of specifications:   144 
## 
## Specifications:
## 
##   Independent variable:     gdp, log(gdp), I(gdp^2) 
##   Dependent variable:       life_expectancy 
##   Models:                   lm, glm 
##   Covariates:               no covariates, population, alcohol, bmi, population + alcohol, population + bmi, alcohol + bmi, population + alcohol + bmi 
##   Subsets analyses:         Developing, Developed, all 
## 
## Function used to extract parameters:
## 
##   function (x) 
## broom::tidy(x, conf.int = TRUE)
## <environment: 0x7ff3c160c568>
## 
## 
## Head of specifications table (first 20 rows):
## # A tibble: 20 × 7
##    x     y               model controls             subsets    status    formula
##    <chr> <chr>           <chr> <chr>                <chr>      <fct>     <glue> 
##  1 gdp   life_expectancy lm    no covariates        Developing Developi… life_e…
##  2 gdp   life_expectancy lm    no covariates        Developed  Developed life_e…
##  3 gdp   life_expectancy lm    no covariates        all        <NA>      life_e…
##  4 gdp   life_expectancy lm    population           Developing Developi… life_e…
##  5 gdp   life_expectancy lm    population           Developed  Developed life_e…
##  6 gdp   life_expectancy lm    population           all        <NA>      life_e…
##  7 gdp   life_expectancy lm    alcohol              Developing Developi… life_e…
##  8 gdp   life_expectancy lm    alcohol              Developed  Developed life_e…
##  9 gdp   life_expectancy lm    alcohol              all        <NA>      life_e…
## 10 gdp   life_expectancy lm    bmi                  Developing Developi… life_e…
## 11 gdp   life_expectancy lm    bmi                  Developed  Developed life_e…
## 12 gdp   life_expectancy lm    bmi                  all        <NA>      life_e…
## 13 gdp   life_expectancy lm    population + alcohol Developing Developi… life_e…
## 14 gdp   life_expectancy lm    population + alcohol Developed  Developed life_e…
## 15 gdp   life_expectancy lm    population + alcohol all        <NA>      life_e…
## 16 gdp   life_expectancy lm    population + bmi     Developing Developi… life_e…
## 17 gdp   life_expectancy lm    population + bmi     Developed  Developed life_e…
## 18 gdp   life_expectancy lm    population + bmi     all        <NA>      life_e…
## 19 gdp   life_expectancy lm    alcohol + bmi        Developing Developi… life_e…
## 20 gdp   life_expectancy lm    alcohol + bmi        Developed  Developed life_e…

Step 2: Run Specification Curve Analysis:

# run specification curve analysis
results <- specr::specr(specs)

Step 3: Inspect the specification curve to understand how robust your findings are to different analytical choices.

These plots help to illustrate where and when you had to make a choice and how many models you might alternatively have specified.

plot(specs, circular = TRUE)

More concretely you could also look at some summary statistics for your p-values across the different model specifications:

summary(results, 
        type = "curve",
        group = c("x", "controls")  # group analysis by model choices
)
## # A tibble: 24 × 9
## # Groups:   x [3]
##    x        controls     median      mad      min     max      q25     q75   obs
##    <chr>    <chr>         <dbl>    <dbl>    <dbl>   <dbl>    <dbl>   <dbl> <dbl>
##  1 I(gdp^2) alcohol     2.31e-9 2.47e- 9 6.45e-10 4.38e-9 1.06e- 9 3.86e-9  1905
##  2 I(gdp^2) alcohol + … 1.77e-9 5.33e-10 6.45e-10 2.13e-9 9.26e-10 2.04e-9  1890
##  3 I(gdp^2) bmi         2.08e-9 8.32e-11 6.37e-10 2.13e-9 9.96e-10 2.12e-9  2013
##  4 I(gdp^2) no covaria… 3.12e-9 2.17e- 9 6.41e-10 4.58e-9 1.26e- 9 4.22e-9  2037
##  5 I(gdp^2) population  3.30e-9 3.94e- 9 6.41e-10 1.15e-8 1.31e- 9 9.42e-9  1846
##  6 I(gdp^2) population… 2.09e-9 2.12e- 9 6.59e-10 9.98e-9 1.02e- 9 8.01e-9  1726
##  7 I(gdp^2) population… 1.75e-9 1.63e- 9 6.56e-10 6.57e-9 9.30e-10 5.37e-9  1711
##  8 I(gdp^2) population… 2.23e-9 2.35e- 9 6.38e-10 6.63e-9 1.03e- 9 5.53e-9  1822
##  9 gdp      alcohol     2.50e-4 1.98e- 4 6.26e- 5 3.83e-4 1.09e- 4 3.50e-4  1905
## 10 gdp      alcohol + … 1.87e-4 4.95e- 5 6.26e- 5 2.21e-4 9.38e- 5 2.12e-4  1890
## # ℹ 14 more rows

Note that statistic mad stands for median absolute deviation.

A more intuitive way of inspecting the results might be to visualize them.

# plot entire visualization
plot(results)

Plot A denotes:

  • X-axis = ordered model specifications (from most negative effect to most positive effect)
  • Y-axis = standardized regression coefficient for the IV-DV relationship
  • Points = the standardized regression coefficient for each specific model
  • Error bars = 95% confidence intervals around the point estimate

Plot B denotes:

  • X-axis = ordered model specifications (the same as panel A)
  • Y-axis (right) = analytic decision categories
  • Y-axis (left) = specific analytic decisions within each category
  • Lines = indicates whether a specific analytic decision was true for that particular model specification

Interpretable Effect Sizes 🔎

While these plots make it easy to compare effects across coefficients, unit changes may not be comparable across variables. We can address this problem in several ways:

  • Re-scale variables to show intuitive unit changes in X (e.g., $1k instead of $1)
  • Re-scale to full scale (minimum to maximum) changes in X
  • Standardize variables to show standard deviation changes in X

Let’s try standardizing our variables to address the problem.

# copy our data
life_expec_dat_stand <- life_expec_dat

# standardize  variables
life_expec_dat_stand <- life_expec_dat_stand |> 
  dplyr::mutate(gdp = effectsize::standardise(gdp))

# specify reasonable model "ingredients"
specs_stand <- specr::setup(
  data = life_expec_dat_stand, 
  y = c("life_expectancy"), # add dependent variable 
  x = c("gdp", "log(gdp)", "I(gdp^2)"), # add independent variables 
  model = c("lm", "glm"), # specify model types
  controls = c("population", "alcohol", "bmi"), # specify controls
  subsets = as.list(distinct(life_expec_dat_stand, status)) # specify subgroup
)

# run specification curve analysis
results_stand <- specr::specr(specs_stand)

# plot entire visualization
plot(results_stand) 

Note that any re-scaling operation will affect how you interpret the coefficients! For more details, check out the effectsize package.

Without putting much thought into our model, it looks like gdp (depending on the specification) can either have a small positive effect or a substantial positive effect on life_expectancy. This just goes to show how careful one has to approach the task of correctly specifying a model!

🚨 Watch Out! ⚠️

Be careful and selective when you use specification curve analysis! It is not a tool that you should apply to each and every potential paper. Especially, if you have not taken a course on causal inference. Specification curve analysis, when it includes variables that act as colliders or mediators, might actually legitimize an otherwise wrongly specified model!


Model Output with broom 🧹

In the lecture we saw that lists of different combinations of model specifications can be unwieldy. Simon also pointed out that we are interested in three main aspects of our model outputs:

  1. Estimated coefficients and associated standard errors, t-statistics, p-values, confidence intervals
  2. Model summaries including goodness of fit measures, information on model convergence, number of observations used
  3. Observation-level information that arises from the estimated model, such as fitted/predicted values, residuals, estimates of influence

Extracting this information from the model output is possible but not straightforward:

str(multiv_model_out[[3]])
## List of 13
##  $ coefficients : Named num [1:2] 64.763 0.955
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "alcohol"
##  $ residuals    : Named num [1:2735] 0.227 -4.873 -4.873 -5.273 -5.573 ...
##   ..- attr(*, "names")= chr [1:2735] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:2735] -3617.34 202.2 -4.78 -5.18 -5.48 ...
##   ..- attr(*, "names")= chr [1:2735] "(Intercept)" "alcohol" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:2735] 64.8 64.8 64.8 64.8 64.8 ...
##   ..- attr(*, "names")= chr [1:2735] "1" "2" "3" "4" ...
....
summary(multiv_model_out[[3]])
## 
## Call:
## .f(formula = .x[[i]], data = ..1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33.96  -4.72   1.62   6.41  20.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
....

Tidying Model Objects

Luckily, we don’t have to deal with this. We can use the tidyverse’s broom package which is part of a collection of packages used for modeling that adhere to tidyverse principles referred to as tidymodels. Tidymodels offers a lot of cool packages that have their own tutorials, so check them out! 💪.

In today’s session, we will only focus on broom. As a refresher, here are the three key broom functions that you should know:

  1. broom::tidy(): Summarizes information about model components.
broom::tidy(multiv_model_out[[3]], conf.int = TRUE, conf.level = 0.95)
## # A tibble: 2 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   64.8      0.253      256.  0           64.3       65.3 
## 2 alcohol        0.955    0.0412      23.1 2.11e-108    0.874      1.04
  1. broom::glance(): Reports information about the entire model.
broom::glance(multiv_model_out[[3]])
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.164         0.164  8.73      536. 2.11e-108     1 -9807. 19621. 19639.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
  1. augment(): Adds information about observations to a dataset.
broom::augment(multiv_model_out[[3]], se_fit = TRUE) 
## # A tibble: 2,735 × 10
##    .rownames life_expectancy alcohol .fitted .se.fit .resid     .hat .sigma
##    <chr>               <dbl>   <dbl>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>
##  1 1                    65      0.01    64.8   0.253  0.227 0.000838   8.74
##  2 2                    59.9    0.01    64.8   0.253 -4.87  0.000838   8.74
##  3 3                    59.9    0.01    64.8   0.253 -4.87  0.000838   8.74
##  4 4                    59.5    0.01    64.8   0.253 -5.27  0.000838   8.74
##  5 5                    59.2    0.01    64.8   0.253 -5.57  0.000838   8.74
##  6 6                    58.8    0.01    64.8   0.253 -5.97  0.000838   8.74
##  7 7                    58.6    0.01    64.8   0.253 -6.17  0.000838   8.74
##  8 8                    58.1    0.03    64.8   0.252 -6.69  0.000834   8.74
##  9 9                    57.5    0.02    64.8   0.253 -7.28  0.000836   8.74
## 10 10                   57.3    0.03    64.8   0.252 -7.49  0.000834   8.74
## # ℹ 2,725 more rows
## # ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>

Doesn’t this look so much tidier?


Broom with Many Models

Where broom really shines is when it comes to dealing with multiple models like we have in our multiv_model_out. It allows us to easily move away from hard-to-deal-with lists while retaining each of the three important model outputs. Since our example replicates several variables in different models, we include the .id argument in purrr::map(). This adds a column to our tibble which will contain a model identifier. In our example, the identifier of each model corresponds with the index number of the model output.

multiv_model_out_broom <- purrr::map_dfr(multiv_model_out[1:3], # the model objects
                                         broom::tidy, # the function to iterate with
                                         .id = "model_type") # id column

multiv_model_out_broom
## # A tibble: 6 × 6
##   model_type term                    estimate std.error statistic   p.value
##   <chr>      <chr>                      <dbl>     <dbl>     <dbl>     <dbl>
## 1 1          (Intercept)            67.0      0.194         346.  0        
## 2 1          gdp                     0.000312 0.0000120      25.9 2.71e-131
## 3 2          (Intercept)            67.9      0.174         391.  0        
## 4 2          percentage_expenditure  0.00183  0.0000817      22.3 2.77e-102
## 5 3          (Intercept)            64.8      0.253         256.  0        
## 6 3          alcohol                 0.955    0.0412         23.1 2.11e-108

Now that we have multiple model outputs in a format that we can work with, it is time to start thinking about visualizing our results. You can plot the residuals, the goodness of fit statistics etc. Similarly, you could also use broom::glance() to provide some robustness statistics or descriptive tables.

The next section will focus on how to present your results!


Results with modelsummary 🥼

For this we will use the modelsummary package by Vincent Arel-Bundock’s. Though there are many packages to choose from when it comes to communicating your results, we highly recommend you give modelsummary a shot!

One reason that we like this package is that it combines both approaches of presenting model results, either as a table or as a coefficient plot. As we saw in the lecture, each method has its advantages and its drawbacks. Therefore, we will cover both here.


Using Regression Tables

The workhorse function in modelsummary is unsurprisingly modelsummary(). 🤔

Here we only need to input the models list as the function does all of the tidying for us.

modelsummary::modelsummary(multiv_model_out[[31]], 
                           output = "kableExtra") # can take multiple formats (e.g., .tex, .doc, .pdf)
Model 1
(Intercept) 61.451
(0.293)
gdp 0.000
(0.000)
percentage_expenditure 0.000
(0.000)
alcohol 0.394
(0.034)
bmi 0.167
(0.007)
hiv_aids −0.779
(0.023)
Num.Obs. 2313
R2 0.631
R2 Adj. 0.630
AIC 14787.7
BIC 14828.0
Log.Lik. −7386.873
RMSE 5.90

It looks pretty solid already, but can definitely be improved upon.

For example, we can set acceptable number of digits:

modelsummary::modelsummary(multiv_model_out[[31]], 
                           output = "kableExtra",
                           fmt = "%.2f") # 2-digits and trailing 0

We can also report only the C.I. by getting rid of the p-values and intercept term:

model_table <- modelsummary::modelsummary(multiv_model_out[[31]], 
                                          output = "kableExtra",
                                          fmt = "%.2f",  # 2-digits and trailing 0  
                                          estimate  = "{estimate}", 
                                          statistic = "conf.int",
                                          coef_omit = "Intercept") 

Wouldn’t it look more professional if our coefficients began with capital letter? And how about we get rid of some of the bloated goodness of fit statistics?

mod_table <- modelsummary::modelsummary(multiv_model_out[31], 
                                        output = "default",
                                        fmt = "%.2f",  # 2-digits and trailing 0  
                                        estimate  = "{estimate}",
                                        statistic = "conf.int",
                                        coef_omit = "Intercept",
                                        coef_rename=c("gdp"="Gdp", 
                                                      "bmi"="Avg. BMI", 
                                                      "alcohol" = "Alcohol Consum.",
                                                      "hiv_aids"= "HIV cases", 
                                                      "percentage_expenditure" = "Health Expenditure (% of GDP)",
                                                      "life_expectancy" = "Life Expectancy"),
                                        gof_omit = 'DF|Deviance|Log.Lik|AIC|BIC',
                                        title = 'An Improved Regression Table')

mod_table
An Improved Regression Table
Model 1
Gdp 0.00
[0.00, 0.00]
Health Expenditure (% of GDP) −0.00
[−0.00, 0.00]
Alcohol Consum. 0.39
[0.33, 0.46]
Avg. BMI 0.17
[0.15, 0.18]
HIV cases −0.78
[−0.82, −0.73]
Num.Obs. 2313
R2 0.631
R2 Adj. 0.630
RMSE 5.90

When using the kableExtra package, you can even post-process your table:

mod_table |> 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, fixed_thead = T) |> 
  kableExtra::row_spec(3, color = 'red') |>
  kableExtra::row_spec(5, background = 'lightblue')
An Improved Regression Table
Model 1
Gdp 0.00
[0.00, 0.00]
Health Expenditure (% of GDP) −0.00
[−0.00, 0.00]
Alcohol Consum. 0.39
[0.33, 0.46]
Avg. BMI 0.17
[0.15, 0.18]
HIV cases −0.78
[−0.82, −0.73]
Num.Obs. 2313
R2 0.631
R2 Adj. 0.630
RMSE 5.90

Using Plots 🎨

If you are looking for an alternative way to present your results, you might also consider a nice coefficient plot.

Again, modelsummary provides a very accessible function for this: modelplot(). As with the regression table, you input the Model List and not the tidied output. The tidying happens under the hood.

# to avoid overplotting, we define a smaller list of models
models <- list(
  "Bivariate OLS" = lm(life_expectancy ~ gdp, data = life_expec_dat),
  "Multivariate OLS" = lm(life_expectancy ~ gdp + total_expenditure + population + alcohol + bmi, data = life_expec_dat),
  "Interaction" = lm(life_expectancy ~  gdp + total_expenditure + population + gdp*total_expenditure, data = life_expec_dat),
  "Polynomial" = lm(life_expectancy ~ gdp + I(gdp^2), data = life_expec_dat)
)

# plot 
modelsummary::modelplot(models, coef_omit = "Intercept") 

Now we just need to customize the plot to our liking:

modelsummary::modelplot(models, coef_omit = "Intercept") +
  labs(x = 'Coefficients', 
       y = 'Term names',
       title = 'Linear regression models of "Life expectancy"',
       caption = "Comparing multiple models, the data originated at the WHO") 

We will cover data visualization with R in much greater detail in the next session, so hopefully the section on coefficient plot will be made clearer then!


Appendix

library(tidyverse)
library(broom)
library(nycflights13)
library(modelsummary)
library(lme4)

Crafting formulas

The basic structure of a formula:

y ~ x

Running a model with a formula is straightforward. Note that we don’t even have to put the formula in parentheses:

lm(arr_delay ~ distance + origin, data = flights)
## 
## Call:
## lm(formula = arr_delay ~ distance + origin, data = flights)
## 
## Coefficients:
## (Intercept)     distance    originJFK    originLGA  
##    13.41405     -0.00405     -2.70426     -4.45617

A more explicit way to write formulas:

fmla <- as.formula("arr_delay ~ distance + origin")
class(fmla)
## [1] "formula"
lm(fmla, data = flights)
## 
## Call:
## lm(formula = fmla, data = flights)
## 
## Coefficients:
## (Intercept)     distance    originJFK    originLGA  
##    13.41405     -0.00405     -2.70426     -4.45617

More formula syntax:

y ~ x1 + x2
## y ~ x1 + x2
y ~ x1 - x2 # ignore x2 in analysis
## y ~ x1 - x2
y ~ . # select all other variables in model matrix
## y ~ .

Interactions:

y ~ x1 * x2 # That's equivalent to y ~ x1 + x2 + x1*x2
## y ~ x1 * x2
y ~ x1:x2  # only interaction, no main effects
## y ~ x1:x2

As-is variables:

y ~ x + I(x^2)
## y ~ x + I(x^2)

Extract all variable names from a formula object:

all.vars(fmla)
## [1] "arr_delay" "distance"  "origin"

Incrementally modify/update a formula:

#?update

Advanced crafting of formulas: fitting multiple models

Create a formula for a model with a large number of variables:

xnam <- paste0("x", 1:25)
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + 
##     x12 + x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + 
##     x22 + x23 + x24 + x25

Create a multitude of covariate sets

Step 1: Define dependent variable + covariate set

depvar <- "arr_delay"
covars <- c("dep_delay", "carrier", "origin", "air_time", "distance", "hour")

Step 2: Build function to run lm models across set of all possible variable combinations

combn_models <- function(depvar, covars, data)
{
  combn_list <- list()
  # generate list of covariate combinations
  for (i in seq_along(covars)) {
    combn_list[[i]] <- combn(covars, i, simplify = FALSE)
  }
  combn_list <- unlist(combn_list, recursive = FALSE)
  # function to generate formulas
  gen_formula <- function(covars, depvar) {
    form <- as.formula(paste0(depvar, " ~ ", paste0(covars, collapse = "+")))
    form
  }
  # generate formulas
  formulas_list <- purrr::map(combn_list, gen_formula, depvar = depvar)
  # run models
  models_list <- purrr::map(formulas_list, lm, data = data)
  models_list
}

Step 3: Run models (careful, this will generate a quite heavy list)

models_list <- combn_models(depvar = depvar,  covars = covars, data = flights)
length(models_list)
## [1] 63

Working with tidy model outputs

Inspecting models in R is straightforward with the summary() function.

lm(arr_delay ~ distance + origin, data = flights) |> summary()
## 
## Call:
## lm(formula = arr_delay ~ distance + origin, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -89.0  -24.0  -11.8    7.3 1281.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.41405    0.17481    76.7   <2e-16 ***
## distance    -0.00404    0.00011   -36.9   <2e-16 ***
## originJFK   -2.70425    0.18871   -14.3   <2e-16 ***
## originLGA   -4.45617    0.19351   -23.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44 on 327342 degrees of freedom
##   (9430 observations deleted due to missingness)
## Multiple R-squared:  0.0055, Adjusted R-squared:  0.00549 
## F-statistic:  604 on 3 and 327342 DF,  p-value: <2e-16

But often you want to post-process estimation results. So let’s examine the output of a model function more closely.

model_out <- lm(arr_delay ~ distance + origin, data = flights)
#class(model_out)
#str(model_out)

Ugh, that’s a lot of information in a list structure. Let’s use some helper functions to unpack these:

#coef(model_out)
#fitted.values(model_out)
#residuals(model_out)
#model.matrix(model_out)

We can also dig in the summary of the model.

#str(summary(model_out))
summary(model_out)$coefficients
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   13.414    0.17481      77  0.0e+00
## distance      -0.004    0.00011     -37 5.5e-297
## originJFK     -2.704    0.18871     -14  1.5e-46
## originLGA     -4.456    0.19351     -23 3.0e-117

The broom philosophy 🧹

The broom package takes the messy output of built-in R functions such as lm(), nls(), or t.test(), and turns them into tidy tibbles.

Example: linear model output

model_out <- lm(arr_delay ~ distance + origin, data = flights)
model_out
## 
## Call:
## lm(formula = arr_delay ~ distance + origin, data = flights)
## 
## Coefficients:
## (Intercept)     distance    originJFK    originLGA  
##    13.41405     -0.00405     -2.70426     -4.45617
summary(model_out)
## 
## Call:
## lm(formula = arr_delay ~ distance + origin, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -89.0  -24.0  -11.8    7.3 1281.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.41405    0.17481    76.7   <2e-16 ***
## distance    -0.00404    0.00011   -36.9   <2e-16 ***
## originJFK   -2.70425    0.18871   -14.3   <2e-16 ***
## originLGA   -4.45617    0.19351   -23.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44 on 327342 degrees of freedom
##   (9430 observations deleted due to missingness)
## Multiple R-squared:  0.0055, Adjusted R-squared:  0.00549 
## F-statistic:  604 on 3 and 327342 DF,  p-value: <2e-16

Examine model object:

#?tidy.lm
broom::tidy(model_out)
## # A tibble: 4 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) 13.4      0.175         76.7 0        
## 2 distance    -0.00405  0.000110     -36.9 5.53e-297
## 3 originJFK   -2.70     0.189        -14.3 1.46e- 46
## 4 originLGA   -4.46     0.194        -23.0 3.04e-117
broom::tidy(model_out, conf.int = TRUE, conf.level = 0.95)
## # A tibble: 4 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept) 13.4      0.175         76.7 0         13.1      13.8    
## 2 distance    -0.00405  0.000110     -36.9 5.53e-297 -0.00426  -0.00383
## 3 originJFK   -2.70     0.189        -14.3 1.46e- 46 -3.07     -2.33   
## 4 originLGA   -4.46     0.194        -23.0 3.04e-117 -4.84     -4.08

Inspect summary statistics:

#?glance.lm
broom::glance(model_out)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df    logLik     AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>     <dbl>   <dbl>  <dbl>
## 1   0.00550       0.00549  44.5      604.       0     3 -1706997.  3.41e6 3.41e6
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Add fitted values and residuals to original data:

#?augment.lm
broom::augment(model_out) |> slice(1:5)
## # A tibble: 5 × 10
##   .rownames arr_delay distance origin .fitted .resid       .hat .sigma   .cooksd
##   <chr>         <dbl>    <dbl> <chr>    <dbl>  <dbl>      <dbl>  <dbl>     <dbl>
## 1 1                11     1400 EWR       7.75   3.25 0.00000922   44.5   1.23e-8
## 2 2                20     1416 LGA       3.23  16.8  0.0000123    44.5   4.37e-7
## 3 3                33     1089 JFK       6.30  26.7  0.00000938   44.5   8.43e-7
## 4 4               -18     1576 JFK       4.33 -22.3  0.00000972   44.5   6.12e-7
## 5 5               -25      762 LGA       5.88 -30.9  0.00000989   44.5   1.19e-6
## # ℹ 1 more variable: .std.resid <dbl>

tidy() can also be applied to htest objects:

tt <- t.test(wt ~ am, mtcars)
tidy(tt)
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic    p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>     <dbl>    <dbl>     <dbl>
## 1     1.36      3.77      2.41      5.49 0.00000627      29.2    0.853      1.86
## # ℹ 2 more variables: method <chr>, alternative <chr>

The true power of broom unfolds in settings where you want to combine results from multiple analyses (using subgroups of data, different models, bootstrap replicates of the original data frame, permutations, imputations, …).

Step 4: (continuing the analysis from above): Extract results from all models

models_broom <- purrr::map(models_list, broom::tidy)
models_broom[[1]] # inspect one list entry
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    -5.90  0.0330       -179.       0
## 2 dep_delay       1.02  0.000786     1296.       0
# turn into df by rbinding all tidied results
models_broom_df <- purrr::map_dfr(models_broom, rbind)

Step 5: Summarize results

models_broom_df |> 
  dplyr::filter(!str_detect(term, "Intercept|carrier")) |>
  ggplot(aes(estimate)) + 
  geom_histogram(binwidth = .1) + 
  geom_vline(xintercept = 0, linetype="dashed") + 
  facet_grid(cols = vars(term), scales = "free_y") + 
  theme_minimal()

Reporting model results

Tables

A table of models:

model1_out <- lm(arr_delay ~ dep_delay, data = flights)
model2_out <- lm(arr_delay ~ distance, data = flights)
model3_out <- lm(arr_delay ~ dep_delay + distance, data = flights)

models <- list(model1_out, model2_out, model3_out)

modelsummary::modelsummary(models, 
                           estimate = "{estimate} [{conf.low}, {conf.high}]",
                           statistic = NULL,
                           gof_omit = ".+",
                           title = "Linear regression of flight delay at arrival (in mins)") 
Linear regression of flight delay at arrival (in mins)
Model 1 Model 2 Model 3
(Intercept) −5.899 [−5.964, −5.835] 10.829 [10.564, 11.095] −3.213 [−3.322, −3.104]
dep_delay 1.019 [1.018, 1.021] 1.018 [1.017, 1.020]
distance −0.004 [−0.004, −0.004] −0.003 [−0.003, −0.002]

Coefficient plots

One model, one plot:

cm <- c("distance" = "Distance",
        "originLGA" = "Origin: LGA",
        "originJFK" = "Origin: JFK")

modelsummary::modelplot(model_out, 
                        coef_omit = "Interc", 
                        coef_map = cm)

Standardized continuous variable:

flights$distance_std <- effectsize::standardize(flights$distance)
model_out_std2 <-lm(arr_delay ~ distance_std + origin, data = flights)

cm <- c("distance_std" = "Distance",
        "originLGA" = "Origin: LGA",
        "originJFK" = "Origin: JFK")

One model, one plot (more options):

modelsummary::modelplot(model_out_std2, coef_omit = "Interc", coef_map = cm) + 
  xlim(-5, .25) + 
  geom_vline(xintercept = 0, linetype="dashed") + 
  labs(title = "Linear regression of flight delay at arrival (in mins)",
       caption = "Data source: nycflights13 package") + 
  theme_minimal()

Re-scale continuous variable:

flights$distance1kmiles <- flights$distance / 1000
model_out_kmiles <-lm(arr_delay ~ distance1kmiles + origin, data = flights)
cm <- c("distance1kmiles" = "Distance (1k miles)",
        "originLGA" = "Origin: LGA",
        "originJFK" = "Origin: JFK")

One model, one plot (more options):

modelsummary::modelplot(model_out_kmiles, coef_omit = "Interc", coef_map = cm) + 
  xlim(-5, .25) + 
  geom_vline(xintercept = 0, linetype="dashed") + 
  labs(title = "Linear regression of flight delay at arrival (in mins)",
       caption = "Data source: nycflights13 package") + 
  theme_minimal()

Multiple options, one faceted plot:

modelsummary::modelplot(models, coef_omit = "Interc") + 
  facet_grid(~model)

library(effectsize)
#?standardize
#model_out_std <- standardize_parameters(model_out)
#modelplot(model_out_std, coef_omit = "Interc", coef_map = cm) + 
#  labs(title = "Linear regression of flight delay at arrival (in mins)",
#       caption = "Data source: nycflights13 package") + 
#  theme_minimal()

True-vs-fitted plots

model_out |> 
  broom::augment() |>
  dplyr::slice_sample(n = 10000) |> 
  ggplot(aes(x = .fitted, y = arr_delay)) +
  geom_point(alpha =  0.25) +
  geom_abline(intercept = 0, slope = 1) + 
  labs(title = "Fitted vs. true values, lm(arr_delay ~ distance + origin)",
       caption = "Data source: nycflights13 package") + 
  xlab("Fitted values") + 
  ylab("Arrival delay (in mins)") + 
  theme_minimal()

model_out_improve <- lm(arr_delay ~ dep_delay + distance + origin, data = flights)

model_out_improve |> 
  broom::augment() |>
  dplyr::slice_sample(n = 10000) |> 
  ggplot(aes(x = .fitted, y = arr_delay)) +
  geom_point(alpha =  0.25) +
  geom_abline(intercept = 0, slope = 1) + 
  labs(title = "Fitted vs. true values, lm(arr_delay ~ dep_delay + distance + origin)",
       caption = "Data source: nycflights13 package") + 
  xlab("Fitted values") + 
  ylab("Arrival delay (in mins)") + 
  theme_minimal()

Further reading


Acknowledgements

This tutorial drew heavily on the vignette from the specr package as well as the Regression Section in McDermott’s Data Science for Economists by Grant McDermott.

This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, and Sebastian Ramirez-Ruiz.